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Abstract 

This report provides an overview of the current state of the art deep learning architectures and 
optimisation techniques, and uses the ADNI hippocampus MRI dataset as an example to compare 
the effectiveness and efficiency of different convolutional architectures on the task of patch-based 3- 
dimensional hippocampal segmentation, which is important in the diagnosis of Alzheimer’s Disease. 
We found that a slightly unconventional ” stacked 2D” approach provides much better classification 
performance than simple 2D patches without requiring significantly more computational power. We 
also examined the popular ”tri-planar” approach used in some recently published studies, and found 
that it provides much better results than the 2D approaches, but also with a moderate increase in 
computational power requirement. Finally, we evaluated a full 3D convolutional architecture, and 
found that it provides marginally better results than the tri-planar approach, but at the cost of a 
very significant increase in computational power requirement. 
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1 Introduction 


Deep learning techniques have been applied to a wide variety of problems in recent years [1] - most 
prominently in computer vision [2], natural language processing [3], and computational audio analysis 
[4]. In many of these applications, algorithms based on deep learning has surpassed the previous 
state-of-art performance. At the heart of all deep learning algorithms is the domain-independent idea 
of using hierarchical layers of learned abstraction to efficiently accomplish high level tasks [1]. 

This report first provide an overview of traditional artificial neural network concepts introduced in 
the 1980s, before introducing more recent discoveries that made training deep networks practical and 
effective. Finally, we present the results of applying multiple deep architectures to the ADNI hip¬ 
pocampus segmentation problem, and comparing their classification performances and computational 
power requirements. 


2 Traditional Neural Networks 


Artificial neural networks (ANN) are a machine learning technique inspired by and loosely based on 
biological neural networks (BNN). While they are similar in the sense that they both use a large number 
of identical and linked simple computational units to achieve high performance on complex tasks, 
modern ANNs have been so heavily optimized for efficient implementation on electronic computers that 
they bear little resemblance to their biological counterpart. In particular, time-dependent integrate- 
and-fire mechanism in BNNs have been replaced by steady state values representing frequency of 
firing, and most ANNs also have vastly simplified connection architectures that allow for efficient 
propagation. Most current ANN architectures don’t allow loops in connections (with the notable 
exception of recurrent neural networks which uses loops to model temporal correlations [5]), and also 
don’t allow connections to be made and broken during training (with the notable exception of evolving 
architectures based on genetic algorithms [6, 7, 8]). Unless otherwise specified, all further references 
to neural networks (NN) in this report refer to artificial neural networks. 


2.1 Network Architectures 

In a typical neural network, nodes are placed in layers, with the first layer being the input layer, and 
the last layer being the output layer. The input nodes are special in that their outputs are simply the 
value of the corresponding features in the input vector. 

For example, in a classification task that has a 3-dimensional input (x, y, z) and a binary output, one 
possible network design is to have 3 input nodes, and I output node. The input and output layers are 
usually considered fixed in network design. 

With only an input layer and an output layer, with all input nodes connected to all output nodes, the 
network essentially implements a matrix multiply, or a linear transformation. This type of networks 
can solve simple problems where the feature space is linearly-separable. However, for linearly-separable 
problems, simpler techniques such as linear regression or logistic regression can usually achieve similar 
performance, with the only difference being training methods. 

Most modern applications of neural networks use one or more hidden layers - layers that sit between 
the input layer and the output layers, to allow the network to model non-linearity in the feature space. 
The number of hidden layers and the number of hidden nodes in each layer are hyper-parameters that 
are not always easy to determine. While some rules-of-thumb have been proposed, they are still, for 
the most part, determined by trial-and-error. The risk of using too-small a network is that it may not 
have enough representative power to model all useful patterns in the input (high bias), while the risk 
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of using too-large a network is that it may overfit the data, and start modeling noise in the training 
set (high variance). It is usually better to err on the side of larger networks, because many effective 
techniques exist to combat overfitting, as will be detailed in later sections of the report. Using network 
size to limit overfitting is error-prone, time-consuming, and not very effective. 

Methods have been proposed for automatic hyperparameter tuning, such as evolving cascade networks 
[9], which trains a large network through an iterative process, by first starting with a minimal network, 
and in each iteration, train a few ’’candidate” networks that have more nodes in different layers, and 
keeping the best. There are also tuning methods based on genetic algorithms with links turned on and 
off by each bit in the genes [6, 7, 8]. However, these methods have not seen widespread adoption, due 
to the large increase in training time, and marginal benefits when overfitting is avoided using other 
methods than limiting network size. 

It has been proven that a network with 1 hidden layer can approximate any continuous (in feature 
space) function to any accuracy, and a network with 2 hidden layers can approximate any function 
to any accuracy [10, 11]. Given infinite computational power, memory, and training set, there is 
theoretically no reason to go above 2 hidden layers. However, as will be explained later in the report, 
it is much more efficient to solve complex problems using a deeper network than one with only 2 
hidden layers. 


2.2 Network Nodes and Activation Functions 

In a neural network, each node (besides input nodes) has one or more scalar inputs, and one output. 
Each link between nodes have a scalar weight, and each node has a bias to shift the point of activation. 


fC^Wi* Xi + b) (1) 

The output of each node is computed as shown in Equation 1, where Xj’s are inputs to the node, tCj’s 
are the weights of the associated link, 6 is a bias associated with the node, and /(x) is a function 
associated with the node, known as the activation function. 

There are a few activation functions in widespread use. Eor output nodes in regression networks, 
a linear activation function (eg. y = x) is most commonly used to give these networks a range 
of all real numbers. For output nodes in classification networks, the softmax function (exponential 
normalization) is often used to transform the outputs into something that can be interpreted as a 
probability distribution. 

For hidden nodes, the traditional choices are hyperbolic tangent {y = tanh{x)) and the logistic function 
{y = )• Both these functions are designed to satisfy 3 conditions - 


• Differentiable everywhere 

• Monotonic 

• Non-linear 


It was believed that these properties are essential for an activation function. 

The differentiability property is important because we must be able to take the derivative of the 
function at any point during training using gradient-based methods. It is not necessary for networks 
trained using non-gradient-based methods such as genetic algorithm. 
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The monotonicity property is important because if the activation function is not monotonic, it will 
introduce additional local minimums in the parameter space, and impede training efforts. 

Non-linearity is important because otherwise the network will lose the ability to model non-linear 
patterns in the training set. Non-linearity is achieved using saturation in this case, with the hyperbolic 
tangent function saturating at y = — 1 and y = 1 , and the logistic function saturating at y = 0 and 
y = 1 . 

In practice, although the logistic function is more biologically plausible, hyperbolic tangent usually 
allows faster training since being linear around 0 means nodes will not start training in saturation 
(which would make training much slower) even if inputs are zero or negative [ 12 ]. 


2.3 Training Neural Networks 

Training algorithms for neural networks fall into two major categories - gradient-based and non¬ 
gradient-based. This report focuses on gradient-based methods as it is much more commonly used in 
recent times, and usually converges much faster as well. 

As mentioned in Section 2 . 2 , each node in the network has a weight associated with each incoming link, 
and a scalar bias. The weight and bias of a node are the parameters of the node. If we concatenate 
the weights and biases of all nodes in a network into one vector 9, it completely defines the behaviour 
of a network (for a given set of hyper-parameters, ie. network architecture). 


y = /( 6 ',x) ( 2 ) 

If the set of hyper-parameters (network architecture) is encoded into a function /(), we can define 
the output of the network as shown in Equation 2, where y and x are the output and input vectors 
respectively. 

The goal of the training process, therefore, is to find a 0 such that f{9, x) approximates the function 
we are trying to model. In other words, given a set of inputs and their desired outputs, we are trying 
to find a 6 that minimizes the difference between the desired outputs and network outputs, for all 
entries in the training set. For that, a measurement of error is needed. 

E{e,Ts) = \ Y. (3) 


One such error measure is mean-squared-error (MSE), and it is the most commonly used error measure. 
This is given in Equation 3, where Ts is the training set, Xi and y* are the input and desired output 
of a training pair, N is the number of entries in the training set, and g{xi) is the network output. 

Our goal is to minimize E{9,Ts) given Ts. For very small networks, it may be feasible to do an 
exhaustive search to find the point in parameter space where the mean-squared-error is minimized, 
but for networks of reasonable sizes, an exhaustive search is not practical. Gradient descent is the 
most commonly used optimisation algorithm for neural networks. 

In gradient descent, 6 is first initialized to a random point in the parameter space. Weights and biases 
are typically drawn in a way that keeps most nodes in the linear region at the beginning of training. 
One popular method is to draw from the uniform distribution (— 7 ^, ), where a is chosen based 

V ^in V ^in 

on the shape of the activation function (where it starts to saturate), and din is the number of inputs 
to the node [13]. 
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( 4 ) 


AO = -L 


dE{e,Ts) 

90 


After initialization, the gradient descent algorithm performs a walk in the parameter space, guided 
by the gradient of the error surface. In its simplest form, in each iteration, the algorithm takes a 
step in the opposite direction of the gradient, with the step size proportional to the magnitude of the 
gradient, and a fixed learning rate. This is shown in Equation 4, where L is the learning rate, and all 
other symbols are as previously defined. 


9E(0, Ts) _ dE dyj dxj 
dwkj dyj dxj dwkj 

The partial derivative of the error function for each parameter is as shown in Equation 5 after applying 
chain rule, where Xj is the result of the summation (input to activation function), yj is the output of 
the activation function, and Wkj is the weight we are examining. 

The only difference between the output layer and hidden layers is that hidden layers do not really have 
an error. However, we can still derive error terms for them by calculating its contribution to the input 
to the activation function of the next node. This is equivalent to another application of chain rule, 
since the input to the activation function is simply the sum of the contributions from each node in the 
previous layer. It is convenient to do this propagation backwards, from the final error, multiplying by 
the derivative of the activation function each time, before ’’assigning blame” of the error proportionally 
to the weights connecting previous layer nodes to the node we are examining. This is the basis of 
back-propagation, and this is why we require the activation function to be differentiable. A detailed 
derivation is omitted here for brevity. 

Many variants of gradient descent have been proposed, each with different performance characteristics. 
The one in most popular use is gradient descent with momentum, where instead of calculating the 
gradient each time and use that as the step, we combine it with a fraction of the weight update of the 
previous iteration. This allows faster convergence in situations where the gradient is much larger in 
some dimensions than others (eg. down the bottom of a valley) [14]. 

Another common variation is learning rate scheduling - changing the learning rate as training pro¬ 
gresses. The goal is to get to somewhere close to a local minimum quickly, then slow down to avoid 
overshooting. This idea is taken further in resilient back-propagation (RPROP), where only the sign 
of the gradient is used. In RPROP, each weight has an independent learning rate, that is increased 
(usually multiplied by 1.2) if the sign of the gradient has not changed from the previous iteration, and 
reduced (usually by a factor of 0.5) if the gradient has changed signs [15]. This allows all weights to 
train at close to their optimal learning rate, and eliminates the learning rate parameter that must be 
manually tuned in other gradient descent variants. An initial learning rate still needs to be chosen, 
but it doesn’t significantly affect training time or result [15]. 


2.4 Regularization 

When the training set is limited in size as it is usually, it is dangerous to train without any constraint 
since the network will eventually start to model noise in the training set, and be too specialized to 
generalize beyond the training set. 


Eie,Ts) = ^ Y, {y,-f(9,x,)f + x\e\^ 

(xi,yi)&Ts 
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One popular way to combat overfitting is regularization - the idea of encouraging weights to have 
smaller values. The most common form of regularization is L2 regularization, where the L2 norm of 9 
(the parameter vector) is added to the error function, as shown in Equation 6, where A is a parameter 
that controls the strength of regularization, and it needs to be tuned. If A is too low, the network 
would overfit. If A is too high, the network would underfit. 

Other norms such as LI and are also used, with different effects. They will be explored in the 
discussion of deep networks later in the report. 


3 Deep Learning 

3.1 Why Build Deep Networks? 

As mentioned earlier in Section 2.1, a neural network with 2 hidden layers is already theoretically 
a universal function approximator capable of approximating any function, continuous or not, to any 
arbitrary accuracy. In light of that, it may seem pointless to pursue networks with more hidden layers. 

The main benefit of using deep networks is node-efficiency - it is often possible to approximate complex 
functions to the same accuracy using a deeper network with much fewer total nodes compared to a 
2-hidden-layer network with very large hidden layers. Besides computational benefits, a model with a 
smaller degree of freedom (number of parameters) requires a smaller dataset to train [16], and size of 
the training set is often a limiting factor in neural network training. 

Intuitively, the reason for a smaller and deeper network to be more effective than an equally sized (in 
total nodes) shallower network is that a deep network reduces the amount of redundant work done. 

As an example, suppose we are given bitmap images containing triangles, rectangles, pentagons, and 
hexagons, and our task is to count the number of each shape in each image. In the case of a deep 
network, the first layers can perform the low level task of calculating image gradients and identifying 
lines in the image, and transform it to a more convenient form. Higher layers can then perform 
classification and counting from the simpler representation. 

On the other hand, if we have a shallow network, the low level tasks would have to be performed 
multiple times, since there is little cross-feeding of intermediate results. This would result in a lot of 
redundant work done. 

With very deep networks, it is possible to model functions that work with many layers of abstraction 
- for example, classifying the gender of images of faces, or the breed of dogs. It is not practical to 
perform these tasks using shallow networks, because the redundant work done is exponential to the 
number of layers, and an equivalent shallow network would require exponentially more computational 
power, and exponentially larger training sets, neither of which are usually available. 


3.2 Vanishing Gradients 

The fact that deeper networks are more computationally efficient has been known for a very long 
time, and deep networks have been attempted as early as 1980 [17]. In 1989, LeCun et al. successfully 
applied a 3-hidden-layer network to ZIP code recognition. However, they were not able to scale to 
higher number of layers or more complex problems due to very slow training, the reason of which was 
not understood. 

In 1991, Hochreiter identified the problem and called it ’’vanishing gradients” [18, 19]. Essentially, 
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what’s happening is as errors are propagated back from the output layer, it’s multiplied by derivatives 
of the activation function at the point of activation. As soon as the propagation gets to a node in 
saturation (where derivative is close to 0), the error is reduced to the level of noise, and nodes behind 
the saturated node train extremely slowly. No effective solution to the problem was found for many 
years. 

Researchers continued to try to build deep networks, but with often disappointing performance. 


3.2.1 Solution 1: Layer-wise Pre-training 

In 2007, Hinton proposed a solution to the problem, and started the current wave of renewed interest 
in deep learning. The idea is to first train each layer in an unsupervised and greedy fashion, to try 
to identify application-independent features from each layer, before finally training the entire network 
on labeled data [20]. 

The way he implemented the idea is with a generative model, in the form of a restricted Boltzmann 
machine (RBM), where each layer contains a set of binary variables with associated probability distri¬ 
butions, and each layer is trained to predict the previous layer using an algorithm known as contrastive 
divergence [20]. 

Another idea in the same vein is to use a neural network to reproduce the original input [21], which 
allows the reuse of all the neural network techniques that have already been developed (unlike for 
RBM). 

The idea is to first start with just the input layer and one hidden layer, and train the network (using 
standard back-propagation gradient descent) to produce the original input, essentially training the 
network to model the identity function [21]. If the hidden layer has fewer nodes than the input 
layer, the output of the hidden layer is then a compressed and more abstract representation of the 
input. This process is repeated to train each hidden layer, with the output of the previous layer as 
the input [21]. The final network can then be trained with the actual output layer using standard 
back-propagation from labeled data. This works around the vanishing gradient problem because when 
the final back-propagation is performed, most of the earlier layers are already trained to provide 
application-independent abstractions. 

In modern implementations, some artificial noise is often injected into the input of autoencoders to 
encourage robustness in the autoencoders, by testing their ability to reconstruct clean input from 
partially-corrupted input [21]. These autoencoders are known as denoising autoencoders. Denoising 
autoencoders perform similarly to systems based on RBMs and their stacked variant - Deep Belief 
Networks (DBN) [21]. 


3.2.2 Solution 2: Rectified Linear Activation Units 

In 2011, a simpler solution to the vanishing gradients problem was proposed by Glorot et al. Their 
solution to the vanishing gradients problem is to simply use an activation function that does not 
reduce the error as it’s propagated back [12]. 

The proposed function is the rectified linear activation function (ReLU), y = max{0,x) [12]. See 
Figure 1 for a comparison of the three activation functions. 

In the activated region (x > 0), the derivative is 1, and in the deactivated region (x <= 0), the 
derivative is 0. It doesn’t change error signals as they are passed through. 

At first glance it seems like a strange choice for an activation function, for 2 reasons - 
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Figure 1: Activation functions - red: ReLU, blue: tanh, green: logistic 


• It is not differentiable at zero 

• It is unbounded on the positive side 


The non-differentiability at zero is inconsequential, since neural networks work in real numbers, it is 
highly unlikely that it will be at exactly x = 0 at any time. In practice, we define the derivative at 
zero to either be 0 or 1, and it doesn’t have any real world effect. 

The unbounded nature of the function on the positive side is more problematic, because they can result 
in very large activations in later layers, which may cause numerical problems. This can be solved using 
LI regularization, which not only limits the magnitude of weights, but also enforces sparsity [12]. 



Ll/2 norm: 2.97 
LI norm: 1.00 
L2 norm: 0.57 



Ll/2 norm: 1.0 
LI norm: 1.0 
L2 norm: 1.0 


(a) Distributed weights (b) Sparse weights 

Figure 2: Example of different norms 

A sparse network is one where for any given input, only a small subset of nodes will be activated. 
The fact that LI and L^ regularization encourage sparsity is in contrast to L2 regularization which 
discourages sparsity. L2 regularization encourages the desired output to be constructed from many 
small weights (which would have a lower L2 norm) instead of one large weight, which would have a 
higher L2 norm, but equal LI norm, and lower L| norm, as shown in Figure 2. In essence, LI and L^ 
regularization encourage nodes to be independent from one another, and not co-evolve with others. 
This can result in more efficient usage of available nodes, and also have computational advantages in 
networks using activation functions with a hard 0 saturation (such as the rectified linear activation) 
[12], since if a node’s output is 0, it does not need to be broadcasted to nodes in the next layer. 

With rectified linear activation, LI regularization, and no pre-training, Glorot et al. reports achieving 
slightly superior or similar performance to previous results based on pre-training [12]. 

One possible reason for the increase in performance is more efficient use of nodes. With a pre-training 
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system, nodes are trained to extract all discriminating patterns from the input data, even if some 
of those patterns are irrelevant for the task at hand. Without pre-training, those irrelevant patterns 
would not be encoded, thus freeing up the nodes for more important patterns for the task. 

However, he also identified one situation where pre-training is still advantageous - for semi-supervised 
problems. In a semi-supervised problem, a large training set is available, but only a small subset is 
labeled. In this case, starting with pre-training using the entire dataset before training the whole 
network with the labeled subset can result in a network that generalizes better [12]. 

For purely-supervised problems, most researchers have abandoned the pre-training idea, and adopted 
ReLU -|- LI regularization instead. 

There is not sufficient research to decide whether ReLU is also superior to traditional activation func¬ 
tions in shallower networks. Most shallow networks still use hyperbolic tangent or logistic activation, 
because they are not affected by the vanishing gradient problem. 


3.3 DropOut 

In 2012, Hinton et al. proposed a technique that greatly improves network performance in cases where 
there is limited training data [22]. The idea stems from the observation that when the training set is 
small, there will be many possible models that will perform well on the training set, but only some 
will perform well on the testing set. 

The traditional solution to this is to train an ensemble of networks from different initialization and/or 
subsets of the training set, then combine their outputs to produce the final output (often by averaging, 
or voting) [23]. This approach is known to work well in improving model performance, but it is often 
not computationally feasible in deep learning, where training a single network can take many hours or 
days even on a fast GPU. Many trained neural networks are also used in real-time applications, and 
using an ensemble would make the model many times slower to evaluate. 

Hinton et al.’s proposal, termed ’’DropOut”, is to randomly de-activate 50% of the nodes of a network 
on each training iteration. A disabled node would not participate in forward propagation (where they 
would output 0), and would block any error signal from propagating through the node during back- 
propagation. When training is done, all nodes are re-enabled, but all weights are halved to maintain 
the same output range [22]. 

They were able to achieve similar results as ensembles of large numbers of networks, with only about 
2x the computational power requirement of a single network. They hypothesized that this is because 
disabling 50% of nodes at random on each iteration forces nodes to independently evolve, instead of 
co-evolving with others. Co-evolution is not optimal because, for example, some nodes can evolve to 
correct mistakes made by other nodes, instead of modeling useful patterns. In DropOut, nodes cannot 
assume other nodes exist, and are forced to be ’’independently useful” [22]. 


3.4 Model Compression and Training Set Augmentation 

As mentioned above in Section 3.3, ensembles of networks often achieve higher performance than 
any constituent network, but are computationally infeasible in most real-world applications. This is 
especially true for problems with small labeled datasets, where single networks are likely to overfit. 

Bucilua et al. proposed that for semi-supervised problems, where a large dataset is available but only 
a small subset is labeled, it may be beneficial to train an ensemble (or another slow and accurate 
model) on the labeled subset, and use it to label all data, before finally training a single network (or 
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another fast model) on the entire dataset, to reproduce both the original labels and the labels added 
by the ensemble [24], 

In applications where there is a shortage of both labeled and unlabeled data, the training set can be 
artificially augmented. Training set augmentation is application-specific. For example, in computer 
vision applications where the network should be translationally and rotationally invariant, additional 
training entries can be formed by translating or rotating images from the original training set [25]. 


3.5 Making Deep Nets Shallow 

Continuing on the theme of model compression, in 2014, Ba and Caruana showed that with the help 
of a high performance deep network, a shallow network can be trained to perform much better than 
a similar shallow network that is trained directly on the training set [26]. 

Their algorithm hrst trains a deep net (or an ensemble of deep nets) on the original training set, and 
use it to provide ’’extended labels” for all entries in the training set [26]. In case of classification 
problems where the output layer is often softmax, the ’’extended labels” are inputs to the softmax 
layer [26]. The inputs to the softmax layer are log probabilities of each class. 

Finally, a shallow network can be trained to predict the log probabilities, instead of the original 
single-class label. This makes training much easier for the shallow network, because multi-class log 
probabilities labels provide much more information than a single-class label. By modelling the log 
probabilities, the shallow network is also mimic-ing how the deep net (or ensemble) will generalize to 
unseen data, which mostly depend on the relative values of log probabilities for classes that are not 
the highest [26]. The performances of these shallow networks are much higher than networks trained 
on the single-class labels. 

This result is significant because it proves that the reason why shallow networks perform worse than 
deep networks is not entirely due to the increase in representative power and flexibility of deep net¬ 
works. It is also due to our current training algorithms being sub-optimal for shallow networks, and 
if we can develop better training algorithms, we can potentially significantly improve the performance 
of shallow networks. 

However, the performance of these mimic-ing shallow networks are still not quite as good as the deep 
networks or ensembles they are mimic-ing [26]. Therefore, the option of creating a mimic-ing shallow 
network allows a tradeoff to be made between accuracy and speed. 


3.6 Convolutional Neural Networks 

Convolutional neural networks are a neural network architecture that uses extensive weight-sharing 
to reduce the degrees of freedom of models that operate on features that are spatially-correlated [17]. 
This includes 2D and 3D images (and 2D videos, which can be seen as 3D images), but it has also 
very recently been successfully applied to natural language processing [27]. 

Convolutional neural networks are inspired by the observation that for inputs like images (with each 
pixel being an input dimension), many low level operations are local, and they are not position- 
dependent [17]. For example, one operation that is useful in many computer vision applications is 
edge detection. In a fully-connected deep network, the edge detector would have to be separately 
trained for each part of the image, even though they would most likely all arrive at similar results. It 
would be better if a kernel can be trained to do edge detection for the entire image at the same time. 

In its current iteration, convolutional neural networks are composed of 3 different types of layers - 
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convolutional, max-pooling, and fully-connected [2]. One typical arrangement is alternating between 
convolutional and maxpooling layers, before finishing off with 2 fully-connected hidden layers. 

Each convolutional layer has the same dimensions as the input, but each pixel is only activated by a 
region of pixels centered around the pixels at the same location in the input images. The weights are 
also shared for each output pixel. In effect, each map in a convolutional layer performs a convolution 
of the input images, with a learned kernel. 

Max-pooling layers perform downsampling on the images. One typical downsample factor is 2x2 
(dividing both width and height by 2). While averaging can also be used, empirical results suggest 
that downsampling by taking the maximum in each sub-region gives the best performance in most 
cases [28]. Max-pooling is responsible for summarizing each sub-region, and it gives the network some 
translational and rotational invariance. 

Fully-connected layers are often used as final layers to encode position-dependent information and 
more global patterns. 

Most existing applications of convolutional neural networks are on 2D images, but the idea can also 
be extended to 3D, with 3D images and 3D kernels. It can be used to process actual 3D images (eg. 
MR Images), or videos, using time as the third dimension. 


4 Hippocampus Segmentation 



Figure 3: Hippocampus 

The hippocampus is a component of human brains responsible for committing short-term episodic and 
declarative memory into long-term memory, as well as navigation [29]. Hippocampus segmentation is 
important in the diagnosis of Alzheimer’s disease (AD), as it is one of the components first affected 
by the disease. A reduction in hippocampal volume can be used as a marker for AD diagnosis [30]. 

Humans have 2 hippocampi, shaped like seahorses, as shown in Figure 3. Our goal is to classify each 
voxel in an MR Image as non-hippocampus, left hippocampus, or right hippocampus. We are using 
this problem to evaluate different deep learning techniques for patch-based segmentation. All images 
are labeled by one human expert. Unfortunately, none of the images have been labeled by more than 
1 human expert to determine variances in human labeling. 
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4.1 Methodology 


We explore 3 convolutional neural network architectures for patch-based segmentation on the ADNI 
Alzheimer’s MRI dataset [31]. 

For all 3 cases, the pre-processing and post-processing done are identical. 

For all 3 cases, 60% (120) of the images are used as the training set, 20% (40) as the validation set, 
and 20% (40) as the testing set. Patches from the same image are always only used in one of the sets. 


4.1.1 Pre-Processing 

Before we begin labeling an image, we first crop it down to a rectangular bounding box, so we can 
perform masking in normalized coordinates. In case of the ADNI dataset, all images are already in 
the same orientation, so no rotation is needed. 

From going through all images in the training set, we determined that the hippocampi are always 
in the region (0.42 < x < 0.81,0.30 < y < 0.67,0.22 < z < 0.80), relative to each dimension of 
the bounding box of their respective brains. We enlarged the region by 0.03 on each side, and use 
(0.39 < X < 0.84,0.27 < y < 0.70,0.19 < z < 0.83) as the mask. All voxels outside of the mask are 
automatically classified as non-hippocampus. All training patches are drawn from within the mask. 


4.1.2 Sampling 

It would be dangerous to draw training voxels uniformly randomly from within the mask, because 
even within the mask, the vast majority of voxels are non-hippocampus, and hence there would be 
very few positive samples. Another problem is that edge voxels (voxels at the edges between positive 
and negative voxels) would be severely under-represented, even though they will most likely be the 
most difficult voxels to classify. 

Therefore, we draw samples as follows - 


• For 50% of the samples, we keep drawing randomly until we get a voxel where the 5x5x5 bounding 
box around the voxel contains more than 1 class 

• For 25% of the samples, we keep drawing randomly until we get a positive voxel 

• For the remaining 25% of the samples, we keep drawing randomly until we get a negative voxel 


This drawing scheme ensures that none of the important types of voxels are under-represented. The 
biggest downside of this scheme is that it distorts the prior probabilities of each class, possible solutions 
to which are discussed later in the report. 


4.1.3 Convolutional Method 1: Stacked 2D Patches 

The first method we tried is to use a stack of 2D patches around each voxel we want to sample. For 
example, for a patch size of 24 and a layer count of 3, we would extract three 24x24 patches - one 
around the voxel in question, one in parallel and above, and one in parallel and below. 

Each of the layers are given to a 2D convolutional neural network as different channels. 
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This method gives the network some 3D context around the voxel (in case of stack sizes greater than 1), 
at a relatively low space overhead. However, the network is not convolutional in the third dimension. 

Network architecture is 20 5x5 kernels in first convolutional layer, 50 5x5 kernels in the second con¬ 
volutional layer, a 1000 nodes fully-connected layer, then finally a softmax layer for exponential nor¬ 
malization. No max-pooling is used, since network performs slightly worse with any max-pooling. 


4.1.4 Convolutional Method 2: Tri-planar Patches 

The second method we tried is the tri-planar method used by Prasoon et al. and Roth et al. in other 
medical imaging applications [32, 33]. 

For each voxel, we extract 3 square patches around the voxel, perpendicular to each axis. For example, 
if we want a patch size of 24, we would extract a 24x24 patch on the x-y plane centered around the 
voxel in question, and a 24x24 patch on the x-z plane, and another 24x24 patch on the y-z plane. 

Since the corresponding pixels from the 3 patches are not spatially correlated in this case, we use a 
network architecture that consists of 2 convolutional layers (20 5x5 kernels and 50 5x5 kernels) for 
each of the 3 patches with no connections between them until the very end, where we feed all their 
outputs into a 1000 nodes fully-connected layer for final classification. No max-pooling is used. 


4.1.5 Convolutional Method 3: 3D Patches 

This approach is an intuitive extension of the 2D approach into 3D. For each voxel we want to sample, 
we take a 3D patch with equal length on each side, around the voxel. For a patch size of 24, we would 
extract 24x24x24 patches. 

Network architecture is 20 5x5x5 kernels in the first convolutional layer, 50 5x5x5 kernels in the second 
convolutional layer, then a 1000 nodes fully-connected layer as before. No max-pooling is used. 


4.1.6 Image Labeling 

After the network is trained, to label an image, patches are extracted for every voxel in the mask 
region (in the correct format for the network architecture in use), and the result is used to label the 
voxel. Any voxel outside of the mask region is automatically classified as negative. 


4.1.7 Training 

All network training are done with standard stochastic gradient descent with a batch size of 50 and a 
fixed learning rate of 0.01. 

At the beginning of training, termination iteration is set to 1 validation period. Validation is done 
after every pass through the training set (24,000 patches). Every time a validation score improves 
the current best validation score by more than 1% (in error, not classification rate), the terminating 
iteration is set to twice the current iteration count. This means training will only be terminated if 
there is no significant improvement for at least the second half of the elapsed time. 
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(a) Before post-processing 


(b) After post-processing 


Figure 4: Before and after post-processing 


4.1.8 Post-Processing 

Besides comparing raw output after labeling by convolutional neural networks, we also want to see 
what kind of performance can we get after some simple post-processing to clean up the results. The 
post-processing applied is the same for all 3 convolutional architectures. 

For each labeled image, we first calculate the centroid of all voxels labeled left-hippocampus, and the 
centroid of all voxels labeled right-hippocampus. 

Then, we divide up the image into blobs (connected voxels with the same classification), and for each 
blob, we check their size. 

If a blob is smaller than a certain threshold (500 voxels in our case), and the labeling is negative 
(non-hippocampus), it is re-labeled to be the nearest hippocampus (based on centroid). 

If a blob is smaller than the threshold, and the labeling is positive (hippocampus), it is re-labeled as 
negative (non-hippocampus). 

We find that these simple post-processing steps clean up most of the obviously mis-classified voxels, 
as shown in Figure 4. 


4.2 Results 

All timing results in this section are obtained on a single NVIDIA GeForce GTX Titan Black GPU, 
using Theano’s GUDA implementation of convolutional neural networks [34]. 

Our first experiment is to determine the effect of number of layers on the performance of the 2D 
convolutional architecture. As shown in Table 1, there are very clear improvements going from I to 3 
layers, but there are no clear improvements beyond 3 layers. However, we also note that number of 
layers has little impact on speed in terms of time per iteration. 

All labeling results are on one of the images in the testing set, with 1755 positive voxels, and 1393947 
negative voxels. All false positive and false negative values are after post-processing, and since the total 
number of positive and negative voxels are the same, the false positive and false negative values are 
directly comparable between runs. They are proportional to (100% — recall) and (100% — precision) 
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respectively. 


# Layers 

Best Val Perf 

Test Perf 

False Pos (vxl) 

False Neg (vxl) 

Iter 

T (mins) 

1 

21.55% 

21.38% 

666 

426 

18240 

22.15 

2 

13.29% 

14.56% 

1173 

413 

15360 

27.67 

3 

8.76% 

9.79% 

513 

283 

25920 

30.39 

4 

11.69% 

11.51% 

706 

446 

19680 

25.33 

5 

8.95% 

9.64% 

490 

306 

18720 

21.99 

7 

9.34% 

9.65% 

479 

312 

23040 

27.65 

9 

9.25% 

10.03% 

577 

292 

22080 

27.37 

11 

10.04% 

11.23% 

583 

331 

8640 

20.37 

13 

9.23% 

10.01% 

761 

283 

28320 

34.52 


Table 1; Performance using different numbers of layers 

The next experiment is to test different patch sizes, also with the 2D architecture. As shown in Table 2, 
there are little benefits in going beyond 24x24. However, in this case, training becomes much slower 
for larger patch sizes (in time per iteration). Therefore, the optimal patch size seems to be 24x24. 


Patch Size 

Best Val Perf 

Test Perf 

False Pos (vxl) 

False Neg (vxl) 

Iter 

T (mins) 

12 

11.33% 

12.64% 

2341 

275 

72960 

10.93 

16 

10.20% 

10.64% 

674 

234 

47520 

12.41 

24 

8.85% 

10.00% 

778 

278 

24480 

11.72 

32 

8.98% 

9.86% 

647 

283 

19680 

13.72 

48 

8.96% 

9.80% 

547 

257 

22560 

27.96 


Table 2: Performance using different 2D patch sizes 

Next, we experiment with patch size for the tri-planar architecture, and see similar results, with the 
optimal patch size being 24x24, as shown in Table 3. There are also very significant reductions in 
training speed as patch size is increased. 


Patch Size 

Best Val Perf 

Test Perf 

False Pos (vxl) 

False Neg (vxl) 

Iter 

T (mins) 

12 

28.74% 

31.69% 

32446 

506 

6240 

9.10 

16 

8.74% 

10.31% 

663 

251 

33120 

96.84 

24 

7.56% 

8.29% 

775 

95 

29760 

95.72 

32 

7.23% 

7.99% 

838 

118 

13920 

95.58 

48 

7.45% 

8.45% 

626 

224 

2640 

243.58 


Table 3: Performance using different tri-planar patch sizes 


Finally, we look at patch sizes for the 3D architecture. Unfortunately, we are constrained by available 
GPU memory in this case, and can only use up to 20x20x20 patches. We find that unlike the previous 
two architectures, the 3D architecture performs well even at a very small patch size of 12x12x12, and 
it is not clear whether it actually benefits from having larger patches, as shown in Table 4. 


Patch Size 

Best Val Perf 

Test Perf 

False Pos (vxl) 

False Neg (vxl) 

Iter 

T (mins) 

12 

7.54% 

8.66% 

1690 

117 

12960 

47.71 

16 

6.73% 

8.05% 

1799 

108 

7680 

66.53 

20 

7.06% 

7.50% 

729 

127 

13920 

194.08 


Table 4: Performance using different 3D patch sizes 


From the above experiments, we selected 3 configurations to investigate further - 


• 2D 24x24, 7 layers 
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• Tri-planar 24x24 

• 3D 20x20x20 


We run each configuration five times to see how consistent they are. The networks are initialized with 
different random seeds each time, but trained on the same training sets. Results are presented in 
Table 5. 


Type 

Best Val Perf 

Test Perf 

False Pos (vxl) 

False Neg (vxl) 

Iter 

T (mins) 

2D 

8.18% 

9.11% 

819 

178 

31680 

42.06 

2D 

8.80% 

8.84% 

811 

235 

42240 

51.56 

2D 

8.76% 

9.41% 

831 

207 

37920 

49.36 

2D 

8.03% 

9.10% 

832 

172 

40800 

50.21 

2D 

8.61% 

9.54% 

804 

221 

39840 

44.64 

Trip 

7.59% 

8.64% 

767 

120 

17280 

78.00 

Trip 

7.64% 

8.96% 

751 

107 

36480 

141.55 

Trip 

7.35% 

8.59% 

798 

106 

18720 

78.14 

Trip 

7.39% 

8.71% 

692 

127 

17280 

100.38 

Trip 

7.33% 

8.46% 

822 

102 

14880 

80.51 

3D 

6.95% 

7.28% 

806 

113 

15360 

161.67 

3D 

6.69% 

7.34% 

724 

136 

14880 

187.57 

3D 

7.46% 

7.71% 

743 

127 

6720 

81.50 

3D 

6.58% 

7.54% 

737 

130 

31680 

301.54 

3D 

7.14% 

7.21% 

714 

155 

17760 

173.73 


Table 5: Multiple runs of best configurations with random initialization (2D 24x24 7 layers, tri-planar 
24x24, 3D 20x20x20) 

While the results on the validation and testing sets are reasonably consistent between each run of the 
same configuration, the actual labeling performance on an entire image is much less consistent. This 
can be due to the fact that the class distribution in the patches used for the training/validation/testing 
sets are not the same as the distribution in an actual image. The discrepancy in image labeling 
performance could be because some models do better at classifying some classes than others, and 
those classes are more common in an actual image. 

From the perspective of speed-accuracy tradeoffs, the 2D architecture is clearly the least accurate, but 
also the fastest to train at approximately 754 iterations per minute. 

The tri-planar architecture has clearly better performance than 2D, and can be trained at approxi¬ 
mately 221 iterations per minute. 

The 3D architecture performs consistently better than the tri-planar architecture in classifying patches, 
but that advantage does not seem to translate well to image labeling performance, where it seems to 
perform slightly worse. It is also the slowest to train at approximately 95 iterations per minute, with 
the longest run taking more than 5 hours to train. 


5 Conclusion and Future Work 


In this project we investigated the use of three different convolutional network architectures for patch- 
based segmentation of the hippocampi region in MRI images. We discovered that the popular tri- 
planar approach offers a good tradeoff between accuracy and training time. While the 3D approach 
performs marginally better at patch classification, it does not seem to perform as well at labeling an 
entire image. This is most likely due to the sampling method altering prior probabilities of the classes 
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presented to the training algorithm, and if this problem is solved, the 3D approach should perform 
marginally better than the tri-planar approach in whole-image labeling as well, but with a much higher 
computational power requirement. 

There are many possible avenues for future investigation. For example, there are learning rate schedul¬ 
ing algorithms available that may significantly shorten training time without affecting quality of re¬ 
sults, such as Zeiler’s famous ADADELTA algorithm [35], which assigns an independent learning rate 
to each weight depending on how often they are activated, so that while some oft-used connections 
can have slower learning rates to achieve higher precision, rarely-used connections can still be trained 
at a higher learning rate to reduce bias. 

It may also be beneficial to give the coordinate of each patch (within the mask) to the fully-connected 
layers of the neural networks, along with the prior probability of each class at that coordinate, deter¬ 
mined by statistical analysis on the training set. 

One possible extension to the tri-planar architecture is to include images at multiple scales for each 
plane, similarly to how it was applied to traffic sign recognition by Sermanet and LeCun [36]. This 
would allow the networks to have more global context, and if the largest scales are big enough to 
include boundaries of the brain, it may compensate for the lack of positional information in the tri- 
planar architecture. This can either be an alternative to, or complement, the idea of giving statistical 
coordinate-based prior probabilities to the network. 
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